This study will seek to compare two statistical analyzes performed with the same data set, but with different programs. On the one hand, the original work by Vicencio (2019) was carried out using the JMP statistics program. The replication of this research will be through the program Rstudio. The same data will be used and the same graphs will be sought, which will make it possible to compare the two sets of results and thus identify the benefits and disadvantages of each of the programs.
The work of Vicencio (2019) sought, as the first approach to its final objective, to identify if it is possible to distinguish different clusters from the same obsidian deposit. Vicencio collected a total of 334 obsidian samples from a 120km2 region, all related to the El Paredón deposit. To achieve his first objective, the author performed a series of PCA-type analyzes and k-means clusters using the 334 samples. A previous geochemical analysis was made using X-ray fluorescence, providing semi-quantitative information, given in parts per million (ppm) of ten elements: Mn, Fe, Zn, Ga, Th, Rb, Sr, Y, Zr, and Nb.
In his results, Vicencio identifies two main sub-sources,El Paredon and Tres Cabezas, both visible in the statistic analysis and with GIS.
PCA and cluster analysis will be carried out with the R program to compare the results of the two studies.
First of all the libraries and dataset must be loaded.
library(curl)
## Using libcurl 7.77.0 with LibreSSL/2.8.3
library(MASS)
library(ggplot2)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(car)
## Loading required package: carData
Upload Data
f <- curl("https://raw.githubusercontent.com/gabovicen/gabovicen-data-replication-assignment/main/Replica_R.2.csv")
d <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(d)
## FRX.Sample Site Mn Fe Zn Ga Th Rb Sr Y Zr Nb Cluster
## 1 PARM1A Sub-source 1 317 11090 60 21 17 168 -1 57 261 44 1
## 2 PARM1B Sub-source 1 341 10745 63 22 16 168 1 53 262 47 1
## 3 PARM1C Sub-source 1 510 11128 70 20 15 162 3 52 279 48 1
## 4 PARM8A Sub-source 1 496 11752 51 19 20 170 3 60 271 49 1
## 5 PARM8B Sub-source 1 538 11118 69 22 19 162 2 63 303 50 1
## 6 PARM8C Sub-source 1 694 12255 77 27 23 164 3 61 296 44 1
summary(d)
## FRX.Sample Site Mn Fe
## Length:334 Length:334 Min. : 75.0 Min. : 5120
## Class :character Class :character 1st Qu.: 336.0 1st Qu.: 8680
## Mode :character Mode :character Median : 381.5 Median :10067
## Mean : 389.9 Mean :10097
## 3rd Qu.: 429.0 3rd Qu.:10980
## Max. :1208.0 Max. :20821
## Zn Ga Th Rb
## Min. : 40.00 Min. :10.00 Min. : 9.00 Min. :113.0
## 1st Qu.: 58.00 1st Qu.:18.00 1st Qu.:15.00 1st Qu.:157.0
## Median : 65.00 Median :21.00 Median :17.00 Median :162.0
## Mean : 65.32 Mean :21.03 Mean :16.56 Mean :162.4
## 3rd Qu.: 72.00 3rd Qu.:23.00 3rd Qu.:18.00 3rd Qu.:168.0
## Max. :107.00 Max. :36.00 Max. :24.00 Max. :196.0
## Sr Y Zr Nb
## Min. :-1.000 Min. :39.00 Min. :138.0 Min. :34.00
## 1st Qu.: 2.000 1st Qu.:47.00 1st Qu.:191.0 1st Qu.:39.00
## Median : 3.000 Median :51.00 Median :203.5 Median :42.00
## Mean : 3.596 Mean :52.42 Mean :232.1 Mean :43.04
## 3rd Qu.: 5.000 3rd Qu.:59.00 3rd Qu.:280.0 3rd Qu.:47.00
## Max. :35.000 Max. :69.00 Max. :389.0 Max. :55.00
## Cluster
## Min. :1.000
## 1st Qu.:1.000
## Median :2.000
## Mean :1.539
## 3rd Qu.:2.000
## Max. :2.000
str(d)
## 'data.frame': 334 obs. of 13 variables:
## $ FRX.Sample: chr "PARM1A" "PARM1B" "PARM1C" "PARM8A" ...
## $ Site : chr "Sub-source 1" "Sub-source 1" "Sub-source 1" "Sub-source 1" ...
## $ Mn : int 317 341 510 496 538 694 557 574 634 503 ...
## $ Fe : int 11090 10745 11128 11752 11118 12255 10498 10498 10958 10023 ...
## $ Zn : int 60 63 70 51 69 77 70 77 62 67 ...
## $ Ga : int 21 22 20 19 22 27 21 24 17 25 ...
## $ Th : int 17 16 15 20 19 23 22 20 24 20 ...
## $ Rb : int 168 168 162 170 162 164 157 160 163 149 ...
## $ Sr : int -1 1 3 3 2 3 3 3 3 2 ...
## $ Y : int 57 53 52 60 63 61 63 62 61 55 ...
## $ Zr : int 261 262 279 271 303 296 274 269 275 254 ...
## $ Nb : int 44 47 48 49 50 44 49 48 45 43 ...
## $ Cluster : int 1 1 1 1 1 1 1 1 1 1 ...
scatterplotMatrix(~ Fe + Zn + Ga + Th + Rb + Sr + Y + Zr + Nb, data=d,legend = TRUE,smooth = list(method=gamLine),diagonal = TRUE,plot.points = TRUE)
Figure 1. Scatter plot from Vicencio, side B of the thesis, Vicencio 2019
As can seen from the two figures, the scatter plot analysis is better represented by the R analysis. It can be seen more clearly how the formation of two groups can already be seen from this scatter plot distribution. The elements Y and Zr the best representatives for dispersion.
With the scatter plot matrix, Vicencio (2019) identifies specific elements that can reveal a better distribution pattern between the two sub-sources, being: Zn, Sr, Y, Zr, and Nb.
Let’s see how these elements are plotted.
d[2]<-as.factor(d$Site)
plot(d$Site, d$Zn, main = "Zn Plot", ylab = "Zn")
d[2]<-as.factor(d$Site)
plot(d$Site, d$Sr, main = "Sr Plot", ylab = "Sr")
d[2]<-as.factor(d$Site)
plot(d$Site, d$Y, main = "Y Plot", ylab = "Y")
d[2]<-as.factor(d$Site)
plot(d$Site, d$Zr, main = "Zr Plot", ylab = "Zr")
d[2]<-as.factor(d$Site)
plot(d$Site, d$Nb, main = "Nb Plot", ylab = "Nb")
##http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/
Principal component analysis (PCA) allows us to summarize and to visualize the information in a data set containing individuals/observations described by multiple inter-correlated quantitative variables. Each variable could be considered as a different dimension.
Principal component analysis is used to extract the important information from a multivariate data table and to express this information as a set of few new variables called principal components. These new variables correspond to a linear combination of the originals (Kassambara 2017)
x: a numeric matrix or data frame scale: a logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place Arguments for princomp(): x: a numeric matrix or data frame cor: a logical value. If TRUE, the data will be centered and scaled before the analysis scores: a logical value. If TRUE, the coordinates on each principal component are calculated
library("factoextra")
data(d)
## Warning in data(d): data set 'd' not found
XRF.elements <- d[1:12, 3:12]
head(d[, 1:12])
## FRX.Sample Site Mn Fe Zn Ga Th Rb Sr Y Zr Nb
## 1 PARM1A Sub-source 1 317 11090 60 21 17 168 -1 57 261 44
## 2 PARM1B Sub-source 1 341 10745 63 22 16 168 1 53 262 47
## 3 PARM1C Sub-source 1 510 11128 70 20 15 162 3 52 279 48
## 4 PARM8A Sub-source 1 496 11752 51 19 20 170 3 60 271 49
## 5 PARM8B Sub-source 1 538 11118 69 22 19 162 2 63 303 50
## 6 PARM8C Sub-source 1 694 12255 77 27 23 164 3 61 296 44
res.pca <- prcomp(XRF.elements, scale = TRUE)
2.Visualize eigenvalues (scree plot). Show the percentage of variances explained by each principal component.
fviz_eig(res.pca)
get_eig(res.pca)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.360459958 33.60459958 33.60460
## Dim.2 2.423465909 24.23465909 57.83926
## Dim.3 1.417447119 14.17447119 72.01373
## Dim.4 1.138139078 11.38139078 83.39512
## Dim.5 0.758386276 7.58386276 90.97898
## Dim.6 0.422621291 4.22621291 95.20520
## Dim.7 0.289927357 2.89927357 98.10447
## Dim.8 0.150518676 1.50518676 99.60966
## Dim.9 0.033703410 0.33703410 99.94669
## Dim.10 0.005330927 0.05330927 100.00000
fviz_pca_ind(res.pca,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
fviz_pca_var(res.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
# Here we see that most of the elements (variables) point to the same direction.
fviz_pca_biplot(res.pca, repel = TRUE,
col.var = "#2E9FDF", # Variables color
col.ind = "#696969" # Individuals color
)
if I would like to predict the new samples inside the two PC’s.
ind.sup.coord <- predict(res.pca, newdata = XRF.elements)
ind.sup.coord[, 1:3]
## PC1 PC2 PC3
## 1 -3.1256292 -0.9426716 1.4899726
## 2 -2.7684170 -0.5050967 0.5879585
## 3 -0.8406120 -0.3369916 -0.1347065
## 4 -0.3593823 -2.7819518 -1.0415899
## 5 1.3899399 -1.2688419 -0.1038501
## 6 3.2428451 -0.2015854 2.7367338
## 7 1.4207343 0.1940572 -1.2691978
## 8 1.3857145 0.9610954 -0.1756749
## 9 1.1746540 -1.0463507 -0.9965159
## 10 -0.8018672 2.9618846 0.4690704
## 11 -0.4780815 2.1085518 -1.1980067
## 12 -0.2398986 0.8579007 -0.3641934
# Plot of active individuals
p <- fviz_pca_ind(res.pca, repel = TRUE)
# Add supplementary individuals
fviz_add(p, ind.sup.coord, color ="blue")
ind.scaled <- scale(XRF.elements,
center = res.pca$center,
scale = res.pca$scale)
# Coordinates of the individividuals
coord_func <- function(ind, loadings){
r <- loadings*ind
apply(r, 2, sum)
}
pca.loadings <- res.pca$rotation
ind.sup.coord <- t(apply(ind.scaled, 1, coord_func, pca.loadings ))
ind.sup.coord[, 1:3]
## PC1 PC2 PC3
## 1 -3.1256292 -0.9426716 1.4899726
## 2 -2.7684170 -0.5050967 0.5879585
## 3 -0.8406120 -0.3369916 -0.1347065
## 4 -0.3593823 -2.7819518 -1.0415899
## 5 1.3899399 -1.2688419 -0.1038501
## 6 3.2428451 -0.2015854 2.7367338
## 7 1.4207343 0.1940572 -1.2691978
## 8 1.3857145 0.9610954 -0.1756749
## 9 1.1746540 -1.0463507 -0.9965159
## 10 -0.8018672 2.9618846 0.4690704
## 11 -0.4780815 2.1085518 -1.1980067
## 12 -0.2398986 0.8579007 -0.3641934
Figure 2. Principal component analysis with the dispersion of two clusters, influenced by the elements: Fe, Sr, Y and Zr. Graph of the three main components that covers 70% of the elemental variance (Vicencio 2019:Figure 39)
The representation of the principal components analysis in the research of Vicencio (2019) was a little clearer, due to how the data was represented. Although, there is a differentiation regarding the trace elements with greater influence among the analyzes, how the data is represented with JMP seems to be more useful when trying to see the sample distribution and clustering.
##AN/BI 588: Course Modules-Cluster Analysis
Cluster analysis uncovers subgroups of observations within a dataset by reducing a large number of observations to a smaller number of clusters. A cluster is a group of observations that are more similar to each other than they are to the observations in other groups (Arora et al.)
library(rattle.data)
library(flexclust)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
##
## Attaching package: 'modeltools'
## The following object is masked from 'package:car':
##
## Predict
library(curl)
f <- curl("https://raw.githubusercontent.com/gabovicen/gabovicen-data-replication-assignment/main/Replica_R.2.csv")
clusterdata <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(clusterdata)
## FRX.Sample Site Mn Fe Zn Ga Th Rb Sr Y Zr Nb Cluster
## 1 PARM1A Sub-source 1 317 11090 60 21 17 168 -1 57 261 44 1
## 2 PARM1B Sub-source 1 341 10745 63 22 16 168 1 53 262 47 1
## 3 PARM1C Sub-source 1 510 11128 70 20 15 162 3 52 279 48 1
## 4 PARM8A Sub-source 1 496 11752 51 19 20 170 3 60 271 49 1
## 5 PARM8B Sub-source 1 538 11118 69 22 19 162 2 63 303 50 1
## 6 PARM8C Sub-source 1 694 12255 77 27 23 164 3 61 296 44 1
ds1<- scale(clusterdata[2:12][-1])
As a first glimpse, we can plot the quantitative data to see a general pattern for the cluster analysis.
plot(ds1)
plot1 <- function(data = ds1, nc=6, seed=123456){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
}
This will help us dtermine how many clusters we want. nevertheless, as Vicencio’s analysis centers on two, we can use this step just as a reference for other options.
plot1(ds1)
The graph we have plotted above has a bend at two, which suggests we should use two clusters, as the original analysis suggested.
This graph is especially useful if you know nothing about your data set.
library(NbClust)
set.seed(1234)
devAskNewPage(ask=TRUE)
nc1<-NbClust(ds1, min.nc=2, max.nc=6, method="kmeans")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 15 proposed 2 as the best number of clusters
## * 2 proposed 3 as the best number of clusters
## * 6 proposed 4 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
If we want to know the distribution of the 334 samples within the two clusters, we can use this function.
set.seed(12)
fit.kmseed<-kmeans(ds1, 2, nstart=25) # performs the k-means cluster analysis
fit.kmseed$size
## [1] 154 180
In this case, 154 samples would be related to one cluster, meanwhile the 180 remaining will be related to the second cluster, revealing the same results as those shown in Vicencio’s research.
Figure 3. K-Means Analysis, side B, Vicencio 2019
fit.kmseed$centers
## Mn Fe Zn Ga Th Rb Sr
## 1 0.01766557 0.6229535 0.6207665 0.3548979 0.2175302 0.5352394 -0.5906522
## 2 -0.01511388 -0.5329713 -0.5311002 -0.3036349 -0.1861091 -0.4579270 0.5053357
## Y Zr Nb
## 1 0.9660370 0.9696201 0.9355399
## 2 -0.8264983 -0.8295639 -0.8004063
# cross-tabulation of type and cluster membership:
ct.kmseed<-table(clusterdata$Site, fit.kmseed$cluster)
ct.kmseed
##
## 1 2
## Sub-source 1 154 0
## Sub-source 2 0 180
library(flexclust)
randIndex(ct.kmseed)
## ARI
## 1
fit.kmseed$cluster # these are each of the clusters
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [334] 2
library(cluster)
clusplot(ds1, fit.kmseed$cluster, main='2D Representation of the Cluster Solution',
color=TRUE, shade=TRUE, plotchar = TRUE, labels=2, lines=0)
ds2<- scale(clusterdata[10:12][-1])
As a first glipse, we can plot the cuantitative data to see a general pattern for the cluster analysis.
plot(ds2)
plot1 <- function(data = ds2, nc=6, seed=123456){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
}
plot1(ds2)
library(NbClust)
set.seed(1234)
devAskNewPage(ask=TRUE)
nc1<-NbClust(ds2, min.nc=2, max.nc=6, method="kmeans")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 15 proposed 2 as the best number of clusters
## * 1 proposed 3 as the best number of clusters
## * 8 proposed 4 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
set.seed(12)
fit.kmseed<-kmeans(ds2, 2, nstart=25) # performs the k-means cluster analysis
fit.kmseed$size
## [1] 154 180
The same sample distribution as in ds1(10 elements )
fit.kmseed$centers
## Zr Nb
## 1 1.0094837 0.9355399
## 2 -0.8636694 -0.8004063
# cross-tabulation of type and cluster membership:
ct.kmseed<-table(clusterdata$Site, fit.kmseed$cluster)
ct.kmseed
##
## 1 2
## Sub-source 1 149 5
## Sub-source 2 5 175
library(flexclust)
randIndex(ct.kmseed)
## ARI
## 0.8834752
fit.kmseed$cluster # these are each of the clusters
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [334] 2
library(cluster)
clusplot(ds2, fit.kmseed$cluster, main='2D Representation of the Cluster Solution',
color=TRUE, shade=TRUE, plotchar = TRUE, labels=2, lines=0)
It is clear that with Zr and Nb the clutsers are better defined.
This result is close to what we see in the original work. The use of trace elements with greater influence on the distribution and conglomeration of groups provides a better result.
Figure 4. Bivariate graph of the two main components, with clustering ellipses at 95 percent probability, made from the two trace elements with the greatest variation: Sr and Zr, Vicencio 2019:Figure 44)
ds3<- scale(clusterdata[10:12][-1])
set.seed(12)
fit.kmseed<-kmeans(ds3, 3, nstart=25) # performs the k-means cluster analysis
fit.kmseed$size
## [1] 178 67 89
fit.kmseed$centers
## Zr Nb
## 1 -0.8723587 -0.8104284
## 2 0.7753195 0.4083707
## 3 1.1610499 1.3134318
# cross-tabulation of type and cluster membership:
ct.kmseed<-table(clusterdata$Site, fit.kmseed$cluster)
ct.kmseed
##
## 1 2 3
## Sub-source 1 3 63 88
## Sub-source 2 175 4 1
library(flexclust)
randIndex(ct.kmseed)
## ARI
## 0.7217842
fit.kmseed$cluster # these are each of the clusters
## [1] 2 2 3 3 3 2 3 3 2 2 2 2 2 3 2 2 3 2 2 3 3 2 3 3 3 3 3 3 2 3 3 3 3 1 1 3 3
## [38] 3 3 2 3 3 3 2 2 2 3 3 3 3 2 2 2 3 3 3 3 2 2 2 2 2 3 3 2 3 3 3 3 3 2 3 2 3
## [75] 3 2 2 2 3 3 1 2 2 3 3 3 2 3 3 3 2 3 3 3 2 3 2 3 3 3 3 2 3 2 3 2 2 2 3 3 2
## [112] 3 3 3 2 2 2 3 2 3 3 3 3 3 2 2 2 3 3 2 2 3 2 2 2 3 2 3 2 2 2 3 2 3 2 3 3 3
## [149] 3 2 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [334] 1
library(cluster)
clusplot(ds3, fit.kmseed$cluster, main='2D Representation of the Cluster Solution',
color=TRUE, shade=TRUE, plotchar = TRUE, labels=2, lines=0)
Even if we try to find three clusters, only two are representative.
As shown in the work by Vicencio (2019), it is possible to identify at least two clusters from the distribution of the 334 geological samples from the El Paredón deposit. I found the two statistical programs extremely useful in representing these data. Although I found some improvements in the way JMP displays the results. I know that the shortcomings of not being able to represent the results in R come from my abilities to handle the program, and I’m sure from what we have seen during class, that in R we as researchers can have more control over the statistical analysis of our work.
Arora, Aarti, Andrew Mark, Natalie Robinson, Audrey Tjahjadi (with modifications by Christopher A. Schmitt) Module 25: Cluster Analysis. AN/BI 588: Course Modules. https://fuzzyatelin.github.io/bioanth-stats/module-25/module-25.html
Kassambara, Alboukadel 2017 Principal Component Analysis in R: prcomp vs princomp. http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/#general-methods-for-principal-component-analysis
Vicencio, A. Gabriel 2019 El Paredón y Tlaxcala: un estudio regional de un yacimiento de obsidiana durante el Formativo Medio y el Formativo Tardío en Tlaxcala. Unpublished Tesis inédita de Maestría, Universidad Nacional Autónoma de México, Mexico City.